snapct: switch to tcp-based ping by ptaffet-jump · Pull Request #8657 · firedancer-io/firedancer

ptaffet-jump · 2026-03-03T04:27:43Z

Closes #8255
Closes #8662
Closes #8661

github-actions · 2026-03-03T04:31:13Z

Performance Measurements ⏳

Suite	Baseline	New	Change
backtest `mainnet-368528500-perf` per slot	`0.061128 s`	`0.060977 s`	`-0.247%` ✅
backtest `mainnet-368528500-perf` snapshot load	`3.087 s`	`2.374 s`	`-23.097%` ✅
backtest `mainnet-368528500-perf` total elapsed	`61.128287 s`	`60.977325 s`	`-0.247%` ✅
`firedancer mem` usage with `mainnet.toml`	`963.38 GiB`	`963.38 GiB`	`0.000%` ✅

Copilot

Pull request overview

This PR updates the snapshot pinger used by snapct to use TCP connection attempts (instead of ICMP) to estimate peer reachability/latency, and adjusts the snapct tile’s sandboxing configuration to accommodate the new ping mechanism.

Changes:

Replace ICMP echo handling in fd_ssping with a pool of nonblocking TCP sockets and poll()-based completion tracking.
Update snapct sandbox allowlists/seccomp policy to account for many ping fds (min/max fd range) and allow poll.
Adjust snapct fd budgeting and allowed-fd population to include all ping sockets.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/discof/restore/utils/fd_ssping.h	Adds `FD_SSPING_FD_CNT` and replaces single-socket API with multi-fd enumeration API.
src/discof/restore/utils/fd_ssping.c	Implements TCP-based “ping” via nonblocking `connect()` and `poll()`, plus fd pool bookkeeping.
src/discof/restore/generated/fd_snapct_tile_seccomp.h	Regenerates seccomp filter to allow `poll` and accept a ping-fd range (min/max).
src/discof/restore/fd_snapct_tile.seccomppolicy	Updates policy inputs and rules to account for multiple ping fds and `poll`.
src/discof/restore/fd_snapct_tile.c	Expands fd budgeting and allowlist population to include all ping fds; passes ping fd range into seccomp generator.

src/discof/restore/utils/fd_ssping.c

Copilot · 2026-03-03T04:36:04Z

src/discof/restore/utils/fd_ssping.c

+      deadline_list_ele_remove( peer->state==PEER_STATE_PINGED ? ssping->pinged : ssping->refreshing, peer, ssping->pool );
+      int is_err = ssping->used_fds[ i ].revents & (POLLRDHUP|POLLERR|POLLHUP);
+      if( FD_LIKELY( !is_err ) ) {
+        peer->latency_nanos  = (ulong)fd_long_max( now - (peer->deadline_nanos - PEER_DEADLINE_NANOS_PING), 1L );
+        peer->state          = PEER_STATE_VALID;
+        peer->deadline_nanos = now + PEER_DEADLINE_NANOS_VALID;
+        deadline_list_ele_push_tail( ssping->valid, peer, ssping->pool );
+
+        FD_LOG_INFO(( "pinged " FD_IP4_ADDR_FMT ":%hu in %lu nanos",
+              FD_IP4_ADDR_FMT_ARGS( peer->addr.addr ), fd_ushort_bswap( peer->addr.port ), peer->latency_nanos ));
+        ssping->on_ping_cb( ssping->cb_arg, peer->addr, peer->latency_nanos );
+      } else {


recv_pings() treats a socket as successfully "pinged" when poll revents does not include POLLERR/POLLHUP/POLLRDHUP, but for nonblocking connect() the reliable way to determine success/failure is to query SO_ERROR via getsockopt(SOL_SOCKET, SO_ERROR). Some connect failures can present as POLLOUT without POLLERR, which would incorrectly mark an unreachable peer as VALID and skew peer selection. Suggest: on POLLOUT, call getsockopt(SO_ERROR) and treat any nonzero error as invalid (and only compute latency / call on_ping_cb when SO_ERROR==0).

Copilot · 2026-03-03T04:36:04Z

src/discof/restore/utils/fd_ssping.c

  switch( peer->state ) {
    case PEER_STATE_UNPINGED:
      deadline_list_ele_remove( ssping->unpinged, peer, ssping->pool );
      break;
    case PEER_STATE_PINGED:
      deadline_list_ele_remove( ssping->pinged, peer, ssping->pool );
      break;
    case PEER_STATE_VALID:
      deadline_list_ele_remove( ssping->valid, peer, ssping->pool );
      break;
    case PEER_STATE_REFRESHING:
      deadline_list_ele_remove( ssping->refreshing, peer, ssping->pool );
      break;
    case PEER_STATE_INVALID:
      return;
  }
  peer->state = PEER_STATE_INVALID;
  peer->deadline_nanos = now + PEER_DEADLINE_NANOS_INVALID;
  deadline_list_ele_push_tail( ssping->invalid, peer, ssping->pool );


fd_ssping_invalidate() can transition a peer from PINGED/REFRESHING to INVALID without releasing its associated ping fd / used_fd_idx. With the new invariant that PINGED/REFRESHING peers are exactly those owning a ping fd, this leaves a stale entry in used_fds/ping_to_pool pointing at a peer that is no longer in those lists (and may later be reused by the pool), which can lead to incorrect behavior or use-after-free. When invalidating a peer that currently has a ping fd, reclaim the fd (e.g., via remove_fdesc_idx(peer->used_fd_idx)) before moving it to INVALID.

Copilot · 2026-03-03T04:36:04Z

src/discof/restore/fd_snapct_tile.c

+              1UL +                             /* logfile */
+              1UL;


rlimit_file_cnt() now adds an extra bare + 1UL with no comment or corresponding fd usage. This looks like an accidental overcount and makes it harder to audit the fd budget (especially now that FD_SSPING_FD_CNT is significant). Either remove this term or document exactly which file descriptor it is intended to account for.

Suggested change

1UL + /* logfile */

1UL;

1UL; /* logfile */

Copilot · 2026-03-03T04:36:04Z

src/discof/restore/fd_snapct_tile.c

                      ulong                  out_fds_cnt,
                      int *                  out_fds ) {
-  if( FD_UNLIKELY( out_fds_cnt<6UL ) ) FD_LOG_ERR(( "out_fds_cnt %lu", out_fds_cnt ));
+  if( FD_UNLIKELY( out_fds_cnt<255UL ) ) FD_LOG_ERR(( "out_fds_cnt %lu", out_fds_cnt ));


populate_allowed_fds() hard-codes a minimum capacity check of 255. This is tightly coupled to the current FD_SSPING_FD_CNT (249) and the current list of always/optionally included fds, and it will silently become wrong if any of those change. Consider expressing this in terms of FD_SSPING_FD_CNT plus the maximum number of non-ssping fds that can be appended in this function (and accounting for the optional logfile fd) so the bound stays correct as the code evolves.

ripatel-fd · 2026-03-03T15:11:05Z

@copilot code review[agent] Does this reliably clean up TCP connection state at the remote end? (RST/FIN sent to free up remote's state)

Copilot · 2026-03-03T15:11:14Z

@ripatel-fd I've opened a new pull request, #8664, to work on those changes. Once the pull request is ready, I'll request review from you.

ptaffet-jump · 2026-03-03T17:30:45Z

Based on a pcap, it seems that it does reliably clean up connection state. Scanning the linux source code confirms this:

It sends an RST when in the states the spec recommends doing so. In other states, it closes the socket, and when the host responds with a SYN+ACK, the kernel sends back an RST because there's no matching socket.

When you call close(), if there's no unread data, it initiates a TCP graceful shutdown, but if there is unread data, it initiates an abortive shutdown. Calling connect with AF_UNSPEC initiates an abortive shutdown in both cases.

snapct: switch to tcp-based ping

e13d6cc

Copilot AI review requested due to automatic review settings March 3, 2026 04:27

Copilot started reviewing on behalf of ptaffet-jump March 3, 2026 04:28 View session

Copilot AI reviewed Mar 3, 2026

View reviewed changes

Copilot AI mentioned this pull request Mar 3, 2026

ssping: use close+recreate instead of connect(AF_UNSPEC) for reliable TCP cleanup #8664

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snapct: switch to tcp-based ping#8657

snapct: switch to tcp-based ping#8657
ptaffet-jump wants to merge 1 commit intomainfrom
ptaffet/snapshot-ping-tcp

ptaffet-jump commented Mar 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

Copilot AI Mar 3, 2026

Uh oh!

ripatel-fd commented Mar 3, 2026

Uh oh!

Copilot AI commented Mar 3, 2026

Uh oh!

ptaffet-jump commented Mar 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ptaffet-jump commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 3, 2026

Performance Measurements ⏳

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

ripatel-fd commented Mar 3, 2026

Uh oh!

Copilot AI commented Mar 3, 2026

Uh oh!

ptaffet-jump commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ptaffet-jump commented Mar 3, 2026 •

edited

Loading

ptaffet-jump commented Mar 3, 2026 •

edited

Loading